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Abstract Finding a fixed point to a nonexpansive operator, i.e., x* = Tx *, abstracts many problems in 
numerical linear algebra, optimization, and other areas of scientific computing. To solve fixed-point problems, 
we propose ARock, an algorithmic framework in which multiple agents (machines, processors, or cores) 
update x in an asynchronous parallel fashion. Asynchrony is crucial to parallel computing since it reduces 
synchronization wait, relaxes communication bottleneck, and thus speeds up computing significantly. At each 
step of ARock, an agent updates a randomly selected coordinate Xi based on possibly out-of-date information 
on x. The agents share x through either global memory or communication. If writing Xi is atomic, the agents 
can read and write x without memory locks. 

Theoretically, we show that if the nonexpansive operator T has a fixed point, then with probability one, 
ARock generates a sequence that converges to a fixed points of T. Our conditions on T and step sizes are 
weaker than comparable work. Linear convergence is also obtained. 

We propose special cases of ARock for linear systems, convex optimization, machine learning, as well as 
distributed and decentralized consensus problems. Numerical experiments of solving sparse logistic regression 
problems are presented. 
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1 Introduction 

Technological advances in data gathering and storage have led to a rapid proliferation of big data in diverse 
areas such as climate studies, cosmology, medicine, the Internet, and engineering [29]. The data involved in 
many of these modern applications are large and grow quickly. Therefore, parallel computational approaches 
are needed. This paper introduces a new approach to asynchronous parallel computing with convergence 
guarantees. 

In a synchronous(sync) parallel iterative algorithm, the agents must wait for the slowest agent to finish 
an iteration before they can all proceed to the next one (Figure la). Hence, the slowest agent may cripple 
the system. In contract, the agents in an asynchronous(async) parallel iterative algorithm run continuously 
with little idling (Figure lb). However, the iterations are disordered, and an agent may carry out an iteration 
without the newest information from other agents. 
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(a) Sync-parallel computing (b) Async-parallel computing 

Fig. 1: Sync-parallel computing (left) versus async-parallel computing (right). 


Asynchrony has other advantages [9] : the system is more tolerant to computing faults and communication 
glitches; it is also easy to incorporate new agents. 

On the other hand, it is more difficult to analyze asynchronous algorithms and ensure their convergence. 
It becomes impossible to find a sequence of iterates that one completely determines the next. Nonetheless, we 
let any update be a new iteration and propose an async-parallel algorithm (ARock) for the generic fixed-point 
iteration. It converges if the fixed-point operator is nonexpansive (Def. 1) and has a fixed point. 

Let Hi ,..., H m be Hilbert spaces and H := H\ x • • • x H rn be their Cartesian product. For a nonexpansive 
operator T : H H, our problem is to 

find x* £ H such that x* = Tx*. (1) 

Finding a fixed point to T is equivalent to finding a zero of S = I — T, denoted by x* such that 0 = Sx*. 
Hereafter, we will use both S and T for convenience. 

Problem (1) is widely applicable in linear and nonlinear equations, statistical regression, machine learning, 
convex optimization, and optimal control. A generic framework for problem (1) is the Krasnosel’skii-Mann 
(KM) iteration [32]: 


x k+1 =x k +a C Tx k - x k ), or equivalently, x k+1 = x k - aSx k , (2) 

where a £ (0,1) is the step size. If FixT — the set of fixed points of T (zeros of S) — is nonempty, 
then the sequence {x k )k> o converges weakly to a point in FixT and ( Tx k — x k )k>o converges strongly to 
0. The KM iteration generalizes algorithms in convex optimization, linear algebra, differential equations, 
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and monotone inclusions. Its special cases include the following iterations: alternating projection, gradi¬ 
ent descent, projected gradient descent, proximal-point algorithm, Forward-Backward Splitting (FBS) [43], 
Douglas-Rachford Splitting (DRS) [36], a three-operator splitting [19], and the Alternating Direction Method 
of Multipliers (ADMM) [36,27]. 

In ARock, a set of p agents, p > 1, solve problem (1) by updating the coordinates Xi G LLi, i = 1,..., m, 
in a random and asynchronous fashion. Algorithm 1 describes the framework. Its special forms for several 
applications are given in Section 2 below. 

Algorithm 1: ARock: a framework for async-parallel coordinate updates 

Input : x° GLL, K > 0, a distribution (pi,... ,p m ) > 0 with EHi Pi = 1; 

global iteration counter k •<— 0; 

while k < K, every agent asynchronously and continuously do 
select ik G {1,..., m} with Prob(ifc = i) = pp, 
perform an update to Xi k according to (3); 
update the global counter k <— k + 1; 


Whenever an agent updates a coordinate, the global iteration counter k increases by one. The fcth update 
is applied to Xi k G 77,: fc , where ik G {1,... , to} is an independent random variable. Each coordinate update 
has the form: 


/£ fc +1 _ j.k 


mpi 


( 3 ) 


where ijk > 0 is a scalar whose range will be set later, Si k x := (0,..., 0, (Sx)i k , 0,..., 0), and mpi k is used to 
normalize nonuniform selection probabilities. In the uniform case, namely, p, = T for all we have mpi k = 1, 
which simplifies the update (3) to 

x k+1 =x k - Vk S ik x k . (4) 


Here, the point x k is what an agent reads from global memory to its local cache and to which Si k is applied, 
and x k denotes the state of x in global memory just before the update (3) is applied. In a sync-parallel 
algorithm, we have x k = x k , but in ARock, due to possible updates to x by other agents, x k can be different 
from x k . This is a key difference between sync-parallel and async-parallel algorithms. In Subsection 1.2 below, 
we will establish the relationship between x k and x k as 


x k = x k + E deJ(k)( xd ~ xd+1 )> ( 5 ) 

where J(k ) C {k — 1, ..., k — r} and r G Z + is the maximum number of other updates to x during the 
computation of (3). Equation (5) has appeared in [37]. 

The update (3) is only computationally worthy if SiX is much cheaper to compute than Sx. Otherwise, 
it is more preferable to apply the full KM update (2). In Section 2, we will present several applications that 
have the favorable structures for ARock. The recent work [44] studies coordinate friendly structures more 
thoroughly. 

The convergence of ARock (Algorithm 1) is stated in Theorems 3 and 4. Here we include a shortened 
version, leaving detailed bounds to the full theorems: 

Theorem 1 (Global and linear convergence) Let T : LL —► Li be a nonexpansive operator that has a 
fixed point. Let (x k )k >o be the sequence generated by Algorithm 1 with properly bounded step sizes r]k- Then, 
with probability one, ( x k )k>o converges weakly to a fixed point of T. This convergence becomes strong if LI 
has a finite dimension. 
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In addition, if T is demicompact (see Definition 2 below), then with probability one, ( x k )k>o converges 
strongly to a fixed point of T . 

Furthermore, if S = I — T is quasi-strongly monotone (see Definition 1 below), then T has a unique 
fixed-point x*, (x k )k >o converges strongly to x* with probability one, and EHaA — x *\\ 2 converges to 0 at a 
linear rate. 

In the theorem, the weak convergence result only requires T to be nonexpansive and has a fixed point. 
In addition, the computation requires: (a) bounded step sizes; (b) random coordinate selection; and (c) a 
finite maximal delay r. Assumption (a) is standard, and we will see the bound can be 0(1). Assumption 
(b) is essential to both the analysis and the numerical performance of our algorithms. Assumption (c) is not 
essential; an infinite delay with a light tail is allowed (but we leave it to future work). The strong convergence 
result applies to all the examples in Section 2, and the linear convergence result applies to Examples 2.2 and 
2.4 when the corresponding operator S is quasi-strongly monotone. Step sizes rjk are discussed in Remarks 
2 and 4. 


1.1 On random coordinate selection 

ARock employs random coordinate selection. This subsection discusses its advantages and disadvantages. 

Its main disadvantage is that an agent cannot caching the data associated with a coordinate. The variable 
x and its related data must be either stored in global memory or passed through communication. A secondary 
disadvantage is that pseudo-random number generation takes time, which becomes relatively significant if 
each coordinate update is cheap. (The network optimization examples in Subsections 2.3 and 2.6.2 are 
exceptions, where data are naturally stored in a distributed fashion and random coordinate assignments are 
the results of Poisson processes.) 

There are several advantages of random coordinate selection. It realizes the user-specified update fre¬ 
quency pi for every component Xi, i = 1,..., m, even when different agents have different computing powers 
and different coordinate updates cost different amounts of computation. Therefore, random assignment en¬ 
sures load balance. The algorithm is also fault tolerant in the sense that if one or more agents fail, it will 
still converge to a fixed-point of T. In addition, it has been observed numerically on certain problems [12] 
that random coordinate selection accelerates convergence. 


1.2 Uncoordinated memory access 

In ARock, since multiple agents simultaneously read and update x in global memory, x k — the result of x 
that is read from global memory by an agent to its local cache for computation — may not equal x 3 for any 
j < k, that is, x k may never be consistent with a state of x in global memory. This is known as inconsistent 
read. In contrast, consistent read means that x k = x 3 for some j < k, i.e., x k is consistent with a state of x 
that existed in global memory. 

We illustrate inconsistent read and consistent read in the following example, which is depicted in Figure 
2. Consider x = [x±, X 2 , £ 3 , £ 4 ] T € R 4 and x° = [0,0,0,0] T initially, at time to- Suppose at time t\, agent 
2 updates x\ from 0 to 1, yielding x 1 = [1,0, 0, 0] r ; then, at time f 2 , agent 3 updates X 4 from 0 to 2, 
further yielding x 2 = [1,0,0, 2] T . Suppose that agent 1 starts reading x from the first component x± at to- 
For consistent read (Figure 2a), agent 1 acquires a memory lock and only releases the lock after finishing 
reading all of X\, X2, £ 3 , and X4. Therefore, agent 1 will read in [0,0,0,0] T . Inconsistent read, however, 
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allows agent 1 to proceed without a memory lock: agent 1 starts reading x\ at to (Figure 2b) and reaches 
the last component, X 4 , after t?: since X 4 is updated by agent 3 prior to it is read by agent 1, agent 1 has 
read [0, 0, 0,2] T , which is different from any of x°, x 1 , and x 2 . 
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(a) Consistent read. While agent 1 reads x in memory, it (b) Inconsistent read. No lock. Agent 1 reads (0. 0. 0, 2 ) 7 , a 
acquires a global lock. non-existing state of x. 

Fig. 2: Consistent read versus inconsistent read: A demonstration. 


Even with inconsistent read, each component is consistent under the atomic coordinate update assumption, 
which will be defined below. Therefore, we can express what has been read in terms of the changes of 
individual coordinates. In the above example, the first change is x\ — X® = 1, which is added to x\ just before 
time t\ by agent 2, and the second change is x 2 — x\ = 2, added to X 4 just before time £2 by agent 3. The 
inconsistent read by agent 1, which gives the result [0,0, 0, 2] T , equals x° + 0 x (x 1 — x°) + 1 x ( x 2 — x 1 ). 

We have demonstrated that x k can be inconsistent, but each of its coordinates is consistent, that is, for 
each i , xf is an ever-existed state of Xi among xf,..., xf” T . Suppose that xf = xy, where d £ { k , k — 1,..., k — 
t}. Therefore, xf can be related to xf through the interim changes applied to x^. Let Ji(fc) C {k — 1,..., k— r} 
be the index set of these interim changes. If Ji(k) 7 ^ 0, then d = minjd £ Jj(fc)}; otherwise, d = k. In addition, 
we have xf = xf = x k + J2d&Ji(k)( x i ~ x f +1 )- Since the global counter k is increased after each coordinate 
update, updates to Xi and Xj , i j, must occur at different fc’s and thus Ji(k) fl Jj{k) = 0, \/i 7 ^ j. Therefore, 
by letting J(fc) := U iJi(k) C {k — 1,..., k — r} and noticing (xf — xf +1 ) = 0 for d £ Jj(k) where i 7 ^ j, 
we have xf = xf + J2d&j(k)( x i ~ x i +1 )yt = 1,... ,m, which is equivalent to (5). Here, we have made two 
assumptions: 

— atomic coordinate update: a coordinate is not further broken to smaller components during an update; 
they are all updated at once. 

— bounded maximal delay r: during any update cycle of an agent, x in global memory is updated at most 
r times by other agents. 

When each coordinate is a single scalar, updating the scalar is a single atomic instruction on most modern 
hardware, so the first assumption naturally holds, and our algorithm is lock-free. The case where a coordinate 
is a block that includes multiple scalars is discussed in the next subsection. 
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1.2.1 Block coordinate 

In the “block coordinate” case (updating a block of several coordinates each time), the atomic coordinate 
update assumption can be met by either employing a per-coordinate memory lock or taking the following 
dual-memory approach: Store two copies of each coordinate x t £ Hi in global memory, denoting them as 
xf' 1 and a;^; let a bit a* £ { 0 , 1 } point to the active copy; an agent will only read Xi from the active copy 
xf before an agent updates the components of Xi, it obtains a memory lock to the inactive copy x^ 1 nt> 
to prevent other agents from simultaneously updating it; then after it finishes updating x a% \ flip the bit 
on so that other agents will begin reading from the updated copy. This approach never blocks any read of 
cCj, yet it eliminates inconsistency. 


1.3 Straightforward generalization 


Our async-parallel coordinate update scheme (3) can be generalized to (overlapping) block coordinate updates 
after a change to the step size. Specifically, the scheme (3) can be generalized to 


x k+i _ x k 


Vk 

np ik 


(U ik °S)x k , 


( 6 ) 


where Ui k is randomly drawn from a set of operators {U \,..., U n } (n < to), Ui : B —> B, following the 
probability P{ik = i) = Pi, i = 1,... ,n (pt > 0, and Pi = !)■ The operators must satisfy ^"=i Ui = 
and J2i =i ll^ a; l| 2 — C|l a: l| 2 f° r some C > 0. 

Let Ui : x (0,..., 0, Xi, 0,..., 0), i = 1,..., to, which has (7=1; then (6) reduces to (3). If H is endowed 
with a metric M such that pi||x|| 2 < \\x\Wj < P 2 ||a:|| 2 (e.g., the metric in the Condat-Vu primal-dual splitting 
[16,55]), then we have 


E Ill'llM < P2 E ll^ll 2 = -^INI 2 < —Mm- 
tt p 1 

In general, multiple coordinates can be updated in (6). Consider linear Ui : x ^ {anXi,--- ,ai m x m ), i = 
1, ..., to, where Y^i=i a ij = 1 f° r eac h j ■ Then, for C := max {EILi a ih' ' ’ i J2i=i a im}i we have 

Tb TL TYl 7TI Tt 

E ii^ii 2 = EE4 ii^ii 2 = EE a b ill'll 2 - c \\ x \\ 2 - 

2 — 1 2=1 j = 1 j= 1 2=1 


1.4 Special cases 

If there is only one agent (p = 1), ARock (Algorithm 1) reduces to randomized coordinate update, which 
includes the special case of randomized coordinate descent [41] for convex optimization. Sync-parallel coor¬ 
dinate update is another special case of ARock corresponding to x k = x k . In both cases, there is no delay, 
i.e., r = 0 and J{k) = 0. In addition, the step size rjk can be more relaxed. In particular, if pi = Vz, then 
we can let i]/- = 77 , V7c, for any r/ < 1, or rj < 1/a when T is a-averaged (see Definition 2 for the definition of 
an a-averaged operator). 
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1.5 Related work 

Cliazan and Miranker [14] proposed the first async-parallel method in 1969. The method was designed for 
solving linear systems. Later, async-parallel methods have been successful applied in many fields, e.g., linear 
systems [3,10,24,51], nonlinear problems [4,5], differential equations [1,2,13,20], consensus problems [34, 
23], and optimization [30,37,38,52,59]. We review the theory for async-parallel fixed-point iteration and its 
applications. 

General fixed point problems. Totally async-parallel 1 iterative methods for a fixed-point problem go 
back as early as to Baudet [5], where the operator was assumed to be P-contraction. 2 Later, Bertsekas [7] 
generalized the P-contraction assumption and showed convergence. Frommer and Szyld [25] reviewed the 
theory and applications of totally async-parallel iterations prior to 2000. This review summarized convergence 
results under the conditions in [7]. However, ARock can be applied to solve many more problems since our 
nonexpansive assumption, though not strictly weaker than P-contraction, is more pervasive. As opposed to 
totally asynchronous methods, Tseng, Bertsekas, and Tsitsiklis [8,54] assumed quasi-nonexpansiveness 3 and 
proposed an async-parallel method, converging under an additional assumption, which is difficult to justify in 
general but can be established for problems such as linear systems and strictly convex network flow problems 
[8,54]. 

The above works assign coordinates in a deterministic manner. Different from them, ARock is stochastic, 
works for nonexpansive operators, and is more applicable. 

Linear, nonlinear, and differential equations. The first async-parallel method for solving linear 
equations was introduced by Chazan and Miranker in [14]. They proved that on solving linear systems, P- 
contraction was necessary and sufficient for convergence. The performance of the algorithm was studied by 
lain et al. [10,51] on different High Performance Computing (HPC) architectures. Recently, Avron et al. [3] 
revisited the async-parallel coordinate update and showed its linear convergence for solving positive-definite 
linear systems. Tarazi and Nabih [22] extended the poineering work [14] to solving nonlinear equations, 
and the async-parallel methods have also been applied for solving differential equations, e.g., in [1,2,13,20]. 
Except for [3], all these methods are totally async-parallel with the P-contraction condition or its variants. 
On solving a positive-definite linear system, [3] made assumptions similar to ours, and it obtained better 
linear convergence rate on that special problem. 

Optimization. The first async-parallel coordinate update gradient-projection method was due to Bert¬ 
sekas and Tsitsiklis [8]. The method solves constrained optimization problems with a smooth objective and 
simple constraints. It was shown that the objective gradient sequence converges to zero. Tseng [53] further 
analyzed the convergence rate and obtained local linear convergence based on the assumptions of isocost 
surface separation and a local Lipscliitz error bound. Recently, Liu et al. [38] developed an async-parallel 
stochastic coordinate descent algorithm for minimizing convex smooth functions. Later, Liu and Wright [37] 
suggested an async-parallel stochastic proximal coordinate descent algorithm for minimizing convex com¬ 
posite objective functions. They established the convergence of the expected objective-error sequence for 
convex functions. Hsieh et al. [30] proposed an async-parallel dual coordinate descent method for solving 
£2 regularized empirical risk minimization problems. Other async-parallel approaches include asynchronous 
ADMM [28,56,59,31]. Among them, [56,31] use an asynchronous clock, and [28,59] use a central node to 

1 “Totally asynchronous” means no upper bound on the delays; however, other conditions are required, for example: each 
coordinate must be updated infinitely many times. By default, “asynchronous” in this paper assumes a finite maximum delay. 

2 An operator T : R n -4 M 71 is P-contraction if \T(x) — T(y) < P\x — y\, component-wise, where x denotes the vector with 
components \x^\, i = 1,..., n, and P £ R nXn is a nonnegative matrix with a spectral radius strictly less than 1. 

3 An operator T : 'H -4 'H is quasi-nonexpansive if || Tx — x*\\ < \\x — x* ||. V.r £ T~L, x* £ FixT. 
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update the dual variable; they do not deal with delay or inconsistency. Async-parallcl stochastic gradient 
descent methods have also been considered in [39,48]. 

Our framework differs from the recent surge of the aforementioned sync-parallel and async-parallel co¬ 
ordinate descent algorithms (e.g., [45,33,38,37,30,49]). While they apply to convex function minimization, 
ARock covers more cases (such as ADMM, primal-dual, and decentralized methods) and also provides se¬ 
quence convergence. In Section 2, we will show that some of the existing async-parallel coordinate descent 
algorithms are special cases of ARock, through relating their optimality conditions to nonexpansive opera¬ 
tors. Another difference is that the convergence of ARock only requires a nonexpansive operator with a fixed 
point, whereas properties such as strong convexity, bounded feasible set, and bounded sequence, which are 
seen in some of the recent literature for async-parallel convex minimization, are unnecessary. 

Others. Besides solving equations and optimization problems, there are also applications of async- 
parallel algorithms to optimal control problems [34], network flow problems [21], and consensus problems of 
multi-agent systems [23]. 


1.6 Contributions 

Our contributions and techniques are summarized below: 

— ARock is the first async-parallel coordinate update framework for finding a fixed point to a nonexpansive 
operator. 

— By introducing a new metric and establishing stochastic Fejer monotonicity, we show that, with probabil¬ 
ity one, ARock converges to a point in the solution set; linear convergence is obtained for quasi-strongly 
monotone operators. 

— Based on ARock, we introduce an async-parallel algorithm for linear systems, async-parallel ADMM al¬ 
gorithms for distributed or decentralized computing problems, as well as async-parallel operator-splitting 
algorithms for nonsmooth minimization problems. Some problems are treated in they async-parallel fash¬ 
ion for the first time in history. The developed algorithms are not straightforward modifications to their 
serial versions because their underlying nonexpansive operators must be identified before applying ARock. 


1.7 Notation, definitions, background of monotone operators 

Throughout this paper, P denotes a separable Hilbert space equipped with the inner product (•, •) and norm 
|| • ||, and (f2,P,P) denotes the underlying probability space, where 17, P, and P are the sample space, 
cr-algebra, and probability measure, respectively. The map x : (17, P) -» (P,B), where B is the Borel a- 
algebra, is an Ff-valued random variable. Let ( x k )k>o denote either a sequence of deterministic points in P 
or a sequence of "H-valued random variables, which will be clear from the context, and let a £ Pi denote the 
ith coordinate of x. In addition, we let X k := a(x°, x 1 , x 1 ,..., x k ,x k ) denote the smallest a -algebra generated 
by x°,x l ,x l ,...,x k ,x k . “Almost surely” is abbreviated as “a.s.”, and the n product space of P is denoted 
by P n . We use —> and —*• for strong convergence and weak convergence, respectively. 

We define FixT := {x £ P \ Tx = a;} as the set of fixed points of operator T, and, in the product space, 
we let X* := {{x*,x*, \ x* £ FixT} C H T+1 . 

Definition 1 An operator T : P —> P is c-Lipschitz , where c > 0, if it satisfies ||Ta; — Ty\\ < c||a; — y |, 
\/x,y £ P. In particular, T is nonexpansive if c < 1, and contractive if c < 1. 
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Definition 2 Consider an operator T 

— T is a-averaged with a £ (0,1), if there is a nonexpansive operator R : R —> R such that T = (1 — 
ot)I-u + aR , where I-u : R —> R is the identity operator. 

— T is /3-cocoercive with /3 > 0, if (x — y, Tx — Ty) > (3\\Tx — Ty \\ 2 , \/x, y &R. 

— T is p-strongly monotone , where y > 0, if it satisfies (x — y , Ta: — Ty) > /i||a; — y || 2 , V:c, y £ R. When the 
inequality holds for /a = 0, T is monotone. 

— T is quasi- /a- strongly monotone , where /a > 0, if it satisfies (x — y,Tx) > y||a: — y|| 2 , Va; £ R,y £ zerT := 

{y £ | Ty = 0}. When the inequality holds for /a = 0, T is quasi-monotone. 

— T is demicompact [46] at x £ R if for every bounded sequence ( x k )k> o in R such that Tx k — x k —> x, 
there exists a strongly convergent subsequence. 

Averaged operators are nonexpansive. By the Cauchy-Schwarz inequality, a /3-cocoercive operator is -p- 

Lipschitz; the converse is generally untrue, but true for the gradients of convex differentiable functions. 

Examples are given in the next section. 


2 Applications 

In this section, we provide some applications that are special cases of the fixed-point problem (1). For each 
application, we identify its nonexpansive operator T (or the corresponding operator S) and implement the 
conditions in Theorem 1. For simplicity, we use the uniform distribution, Pi = ■ ■ ■ = p m = 1 /m, and apply 
the simpler update (4) instead of (3). 

2.1 Solving linear equations 

Consider the linear system Ax = b , where A £ R mxm is a nonsingular matrix with nonzero diagonal 
entries. Let A = D + R, where D and R are the diagonal and off-diagonal parts of A, respectively. Let 
M := —D~ l R and T{x) := Mx + D~ l b. Then the system Ax = b is equivalent to the fixed-point problem 
x = T> -1 (6— Rx) =: T( x), where T is nonexpansive if the spectral norm ||M|| 2 satisfies ||M|| 2 < 1. The 
iteration x k+1 = T(x k ) is widely known as the Jacobi algorithm. Let S = I — T. Each update Si k x k involves 
multiplying just the i^th row of M to x and adding the ifcth entry of D~ 1 b, so we arrive at the following 
algorithm. 

Algorithm 2: ARock for linear equations 
Input : x° £ R", K > 0. 
set the global iteration counter k = 0; 

while k < K, every agent asynchronously and continuously do 
select ik £ {1,..., m} uniformly at random; 

subtract a Vk (XOj a ikj ~ from the component Xi k of the variable x\ 
update the global counter k t— k + 1; 


Proposition 1 [ 6 , Example 22.5] Suppose that T is c-Lipschitz continuous with c £ [0,1). Then, I — T is 
(1 — c)-strongly monotone. 

Suppose ||M|| 2 < 1. Since T is ||M|| 2 -Lipschitz continuous, by Proposition 1, S is (1 — ||M|| 2 )-strongly 
monotone. By Theorem 4, Algorithm 2 converges linearly. 








10 


Z. Peng, Y. Xu, M. Yan, and W. Yin 


2.2 Minimize convex smooth function 


Consider the optimization problem 


minimize /( x), 

XZl'H 


(7) 


where / is a closed proper convex differentiable function and V/ is L-Lipschitz continuous, L > 0. Let 
S := -jr V/. As / is convex and differentiable, x is a minimizer of / if and only if a: is a zero of S. Note 
that S is ^-cocoercive. By Lemma 1, T = I — S is nonexpansive. Applying ARock, we have the following 
iteration: 

x k+1 = x k - r/kS ik x k , (8) 

where S ik x = -|(0,..., 0, X7 ik f(x), 0,..., 0) T . Note that V/ needs a structure that makes it cheap to compute 
\/i k f{x k ). Let us give two such examples: (i) quadratic programming: f(x) = ^x T Ax — b T x, where X f(x) = 
Ax—b and V i k f{x k ) only depends on a part of A and 6; (ii) sum of sparsely supported functions: / = fj 
and V/ = X^yLi V/j, where each fj depends on just a few variables. 

Theorem 3 below guarantees the convergence of (x k )k> o if Vk £ [r/ m i n • 2 rTJm+i ) m Edition, If f(x) is 
restricted strongly convex , namely, for any x £fi and x* £ A*, where A'* is the solution set to (7), we have 
(x — x* 1 Xf{x)) > n\\x — x*^ 2 for some /i > 0, then S is quasi-strongly monotone with modulus yL. According 
to Theorem 4, iteration (8) converges at a linear rate if the step size meets the condition therein. 

Our convergence and rates are given in term of the distance to the solution set X*. In comparison, the 
results in the work [38] are given in terms of objective error under the assumption of a uniformly bounded 
(x k )k> o- I 11 addition, their step size decays like O(A^) for some p > 1 depending on r, and our O(L) is 
better. Under similar assumptions, Bertsekas and Tsitsiklis [8, Section 7.5] also describes an algorithm for (7) 
and proves only subsequence convergence [8, Proposition 5.3] in R ra . 


2.3 Decentralized consensus optimization 

Consider that m agents in a connected network solve the consensus problem of minimizing Xw=i where 

x £ is the shared variable and the convex differentiable function /, is held privately by agent i. We assume 
that V/i is Li-Lipschitz continuous for all i. A decentralized gradient descent algorithm [40] can be developed 
based on the equivalent formulation 

minimize /(x) := i fi{ x i)i subject to Wx = x, (9) 

*i,...,x m eR d 

where x = (x±, ...,x m ) T £ ]g m xd anc j £ R mxm is the so-called mixing matrix satisfying: Wx = x if 
and only if X\ = • • • = x m . For i j, if Wij ^ 0, then agent i can communicate with agent j: otherwise 
they cannot. We assume that W is symmetric and doubly stochastic. Then, the decentralized consensus 
algorithm [40] can be expressed as x fc+1 = Wx k — 7 V/(x fe ) = x fe — 7 (V/(x fe ) + L(J — W)x fc ), where 
V/(x) £ R mxd is a matrix with its *th row equal to (V/j(*j)) T ; see [58]. The computation of Wx k involves 
communication between agents, and V fi(xi) is independently computed by each agent i. The iteration is 
equivalent to the gradient descent iteration applied to min x 1 fi{ x i) + iy xT (7 — W)x. To apply our 
algorithm, we let S := jjXF = ^(V/ + ^(7 — W)) with L = max^ Li + (1 — A m i n (IF))/7, where A m i n (A) is 
the smallest eigenvalue of W. Computing SiX k reduces to computing V fi{x k ) and the itli entry of Wx k or 
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which involves only x * and Xj from the neighbors of agent i. Note that since each agent i can 
store its own a 7 locally, we have x\ = x\. 

If the agents are p independent Poisson processes and that each agent i has activation rate A*, then 
the probability that agent i activates before other agents is equal to A [35] and therefore our random 
sample scheme holds and ARock applies naturally. The algorithm is summarized as follows: 

Algorithm 3: ARock for decentralized optimization (9) 

Input : Each agent i sets x ° £ R d , K > 0. 

while k < I\ do 

when an agent i is activated, £.f +1 = x% — ^ (V fi(x%) + ^(x\ — w i,j x j)) > 
increase the global counter k t— k + 1 ; 


2.4 Minimize smooth + nonsmooth functions 


Consider the problem 


minimize f(x) + g{x), 

xZi'H 


( 10 ) 


where / is closed proper convex and g is convex and L-Lipschitz differentiable with L > 0. Problems in the 
form of ( 10 ) arise in statistical regression, machine learning, and signal processing and include well-known 
problems such as the support vector machine, regularized least-squares, and regularized logistic regression. 
For any x £ TL and scalar 7 £ (0, jf), define the proximal operator proxy : TL —> TL and the reflective-proximal 
operator refl / : TL —t TL as 


prox 7/ (x) 


:= argmin f(y) + 
yew 


1 


\\y-xf 


and r efl 7 y := 2prox, f — / w , 


( 11 ) 


respectively, and define the following forward-backward operator T FB s := prox 7 y o (/ — 7 V 5 ). Because 
prox 7 y is ^-averaged and (/ — qVg) is ^-averaged, T F bs is a-averaged for a £ [|, 1) [ 6 , Propositions 4.32 
and 4.33]. Define S := I — T F bs = / — prox 7 y o (I — 7 V <7). When we apply Algorithm 1 to T = Tp bs to 
solve ( 10 ), and assume / is separable in all coordinates, that is, f(x) = the update for the i^th 

selected coordinate is 

x i^ = x t - % - P rox 7 / ifc ( x ik ~ 7 V ik g(x k ))), (12) 

Examples of separable functions include l\ norm, £2 norm square, the Huber function, and the indicator func¬ 
tion of box constraints, i.e., {x\di < 27 < b i: V*}. They all have simple prox maps. If rj *. £ [rymin; 2 t/J m+ 1 )’ 
then the convergence is guaranteed by Theorem 3. To show linear convergence, we need to assume that 
g(x ) is strongly convex. Then, Proposition 2 below shows that prox 7 y o (/ — 7 V 5 ) is a quasi-contractive 
operator, and by Proposition 1, operator I — prox 7 y o (/ — 7 V 5 ) is quasi-strongly monotone. Finally, linear 
convergence and its rate follow from Theorem 4. 

Proposition 2 Assume that f is a closed proper convex function, and g is L-Lipschitz differentiable and 
strongly convex with modulus p > 0. Let 7 £ (0, jf). Then, both I — 7 Vg and pro x 7 y o (/ — 7 Vg) are 
quasi-contractive operators. 
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Proof We first show that I — 7 V 9 is a quasi-contractive operator. Note 

IK* - 7 Vff(x)) - (x* - 7 Vtjf(x *))|| 2 
= ||* - **|| 2 - 2 7 (x - x*,\7g(x) - Vg(x*)) + 7 2 || Vg(x) - V 5 (x *)|| 2 
< 11 * — x*\\ 2 - 7 ( 2 - 7 L){x - x*g(x) -\7g{x*)) 

<(1 - 27/1 + g'y 2 L)\\x - a:*|| 2 , 

where the first inequality follows from the Baillon-Haddad theorem 4 and the second one from the strong 
convexity of g. Hence, / — 7 S7g is quasi-contractive if 0 < 7 < 2/L. Since / is convex, prox 7 y is firmly 
nonexpansive, and thus we immediately have the quasi-contractiveness of prox f o (/ — 7 V 9 ) from that of 
/ 7 V//. 


2.5 Minimize nonsmooth + nonsmooth functions 


Consider 


minimize f(x) + g{x), 

xSL'H 


(13) 


where both f{x) and g(x) are closed proper convex and their prox maps are easy to compute. Define the 
Peaceman-Rachford [36] operator: 

Tprs ■= refl^y o refl 7S . 


Since both refl 7 y and refl 75 are nonexpansive, their composition Tprs is also nonexpansive. Let S := 
I — Tprs- When applying ARock to T = Tprs to solve problem (13), the update ( 6 ) reduces to: 


Z k+1 = Z k - gkU ik c> (I - refl 7 /o refl 79 ) 2 fe , (14) 

where we use 3 instead of x since the limit z* of (z k )k >0 is not a solution to (13); instead, a solution must be 
recovered via x* = prox jg z*. The convergence follows from Theorem 3 and that Tprs is nonexpansive. If 
either / or g is strongly convex, then Tprs is contractive and thus by Theorem 4, ARock converges linearly. 
Finer convergence rates follow from [17,18]. A naive implementation of (14) is 


x k = prox 7S (£ fe ), 

(15a) 

y k = prox 7/ (2i fe - z k ), 

(15b) 

k+1 =z k + 2 Vk U lk (y k ~x k ), 

(15c) 


where x k and y k are intermediate variables. Note that the order in which the proximal operators are applied 
to / and g affects both z k [57] and whether coordinate-wise updates can be efficiently computed. Next, we 
present two special cases of (13) in Subsections 2.5.1 and 2.6 and discuss how to efficiently implement the 
update (15). 


4 Let g be a convex differentiable function. Then, Vg is L-Lipschitz if and only if it is 7 -cocoercive. 
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2.5.1 Feasibility problem 


Suppose that C \,..., C m are closed convex subsets of H with a nonempty intersection. The problem is to find 
a point in the intersection. Let X^. be the indicator function of the set C), that is, Ic t (x) = 0 if x £ Ci and 
oo otherwise. The feasibility problem can be formulated as the following 

m 

minimize V'Xc^Xj) + li Xl= ... =x \( x). 

x=(x± ,...,x rn )€'H rn z ' 


Let z k = (z k ,..., z£ H m , z k = (z k ,..., z^) £ H m , and z k £ H. We can implement (15) as follows (see 
Appendix A for the step-by-step derivation): 


m Lj i- 1 H 1 

= Projc ife (2 t-zflX 
= z k +2r lk m ;-! fe ). 


(16a) 

(16b) 

(16c) 


The update (16) can be implemented as follows. Let global memory hold zi ,..., z m , as well as z = A Y2iLi A- 

At the Arth update, an agent independently generates a random number A G {l,...,m}, then reads 

, _ ±.k 

as z k and z as z , and finally computes and updates in global memory according to (16). Since 
z is maintained in global memory, the agent updates 5 according to z k+1 = z k + A (^ +1 — z k k ). This 
implementation saves each agent from computing (16a) or reading all Z \, ..., z m - Each agent only reads z.- lk 
and z , executes (16b), and updates Zi k (16c) and z. 


2.6 Async-parallel ADMM 


This is another application of (15). Consider 

minimize f(x) + g(y) subject to Ax + By = b 1 (17) 

i£Wi, yZ-'Hi 


where Hi and H 2 are Hilbert spaces, A and B are bounded linear operators. We apply the update (15) to 
the Lagrange dual of (17) (see [26] for the derivation): 

minimize df(w ) + d g (w), (18) 


where df(w ) := f*(A*w ), d g (w) := g*(B*w) — ( w,b ), and f* and g* denote the convex conjugates of / 
and g, respectively. The proximal maps induced by df and d g can be computed via solving subproblems 
that involve only the original terms in (17): z + = pro x.^ d (z) can be computed by (see Appendix A for the 
derivation) 

(x + £ argmin x / (x) - (z,Ax) + |||Ar|| 2 , 
z + = z — 7 Ax + , 

and 2 + = prox 7dg (z) by 


y+ £ arg min y g(y) - {z,By - b) + %\\By - 6 || 2 , 
z + = z — 7 (By + — b). 


(20) 
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Plugging (19) and (20) into (15) yields the following naive implementation 


y k 

e argmin g{y) - (z k , By - 
y 


By-bf, 

(21a) 

< 

= z k -^Bf -b), 



(21b) 

x k 

G argmin/(:r) — (2 w k — z 

x y 

: k ,Ax) + 


(21c) 

w f 

= 2w k - z k - ^Ax k , 



(21d) 

,fe+l 

'ik 

= 4 + Vk{u> k Uk - w k g>ik ). 



(21e) 


Note that 2in (15c) becomes rjk in (21e) because ADMM is equivalent to the Douglas-Rachford operator, 
which is the average of the Peaceman-Rachford operator and the identity operator [36]. Under favorable 
structures, (21) can be implemented efficiently. For instance, when A and B are block diagonal matrices 
and f,g are corresponding block separable functions, steps (21a)-(21d) reduce to independent computation 
for each i. Since only w k ik and w k ik are needed to update the main variable z k , we only need to com¬ 
pute (21a)-(21d) for the i^th block. This is exploited in distributed and decentralized ADMM in the next 
two subsections. 


2.6.1 Async-parallel ADMM for consensus optimization 
Consider the consensus optimization problem: 

minimize X)]=i fi( x i) subject to Xi — y = 0, Vi = l,...,m, (22) 


where fifxi) are proper close convex functions. Rewrite (22) to the ADMM form: 


\-~\m n / 

minimize 2^i-i U\ x i> 

Xi,y^n 

+ g(y) 





'In 0 

o ' 


Xi' 


In 


subject to 

0 I n ■■■ 

0 


X2 

- 

In 

II 

© 


l 

O 

O 

i 


_ X 7Tl_ 


Jn_ 



(23) 


where g = 0. Now apply the async-parallel ADMM (21) to (23) with dual variables zi,...,z m £ B. In 
particular, the update (21a), (21b), (21c), (21d) reduce to 

f = argmin, { TZi^tv) + Z?= i 4 

{w k dg )i = z k + 1 y k 

x k = argmin^,. { fi(xi ) - (2 (w^)i - z k , Xi) + t[||xi|| 2 }, 

(w k dj ) i = 2(w k dg ) i -z k - 1 x k 


( 24 ) 









ARock: Async-Parallel Coordinate Updates 


15 


Therefore, we obtain the following async-parallel ADMM algorithm for the problem (22). This algorithm 
applies to all the distributed applications in [11]. 

Algorithm 4: ARock for consensus optimization 
Input : set shared variables y °, Vi, and K > 0. 
while k < K every agent asynchronously and continuously do 
choose ik from {1, ...,m} with equal probability; 
evaluate {w k dg ) ik , x k k , and (w k df ) ik following (24); 
update 4 +1 = 4 + y k (( w k dj ) ik - (w^) ik ); 

update y k+1 = y k + 4 (4 “ 444 
update the global counter k <— k + 1; 


2.6.2 Async-parallel ADMM for decentralized optimization 

Let V = {1, be a set of agents and E = {( i,j ) | if agent i connects to agent j,i < j } be the set of 

undirected links between the agents. Consider the following decentralized consensus optimization problem 
on the graph G = ( V, E ): 

minimize f{x \,..., x m ) := i fi( x i), subject to Xi = Xj, V(i,j) £ E, (25) 

where x\,...,x m £ are the local variables and each agent can only communicate with its neighbors in 
G. By introducing the auxiliary variable ytj associated with each edge (i, j) £ E , the problem (25) can be 
reformulated as: 


minimize YnLififa), 

n,Vij 


subject to Xi = yij, Xj = V(i, j) £ E. 


(26) 


Define x = ( X\ , ...,x m ) T and y = ( Uij)(i,j)eE £ W E ' d to rewrite (26) as 


minimize V)”!, fi{xi ), subject to Ax + By = 0, 

x,y 


(27) 


for proper matrices A and B. Applying the async-parallel ADMM (21) to (27) gives rise to the following 
simplified update: Let E(i) be the set of edges connected with agent i and |.E(i)| be its cardinality. Let 
L(i) = {j | (j,i) £ E(i),j < i} and R(i) = {j | ( i,j ) £ E(i),j > i}. To every pair of constraints Xi = 
and Xj = y^ , ( i,j ) £ E, we associate the dual variables and Zjjj , respectively. Whenever some agent i 
is activated, it calculates 


x k =argmin/ i (a; i ) + ( ^ 4 r,r) x i + ||£(*)HN| 2 , 

Xl l£L{i) rER(i) 

44 =Z li,i ~ Vk((Zii,i + Zuj)/2 + JX k ), \/l £ L(i), 

4r4 =Z ir,i - Vk((Zir,i + Z ir,r )/ 2 + v4)> Vr € #(*)■ 


(28a) 

(28b) 

(28c) 
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We present the algorithm based on (28) for problem (25) in Algorithm 5. 

Algorithm 5: ARock for the decentralized problem (26) 

Input : Each agent i sets the dual variables = 0 for e G E(i), K > 0. 
while k < K, any activated agent i do 

(previously received zfe z from neighbors l G L(i) and z k rr from r G !?(*)); 
update x k according to (28a); 

update zf^ 1 and z k ^ according to (28b) and (28c), respectively; 
send ZjYT 1 to neighbors l G L{i) and z k ^ to neighbors r G R(i); 

Algorithm 5 activates one agent at each iteration and updates all the dual variables associated with the 
agent. In this case, only one-sided communication is needed, for sending the updated dual variables in the 
last step. We allow this communication to be delayed in the sense that agent i’s neighbors may be activated 
and start their computation before receiving the latest dual variables from agent i. 

Our algorithm is different from the asynchronous ADMM algorithm by Wei and Ozdaglar [56]. Their 
algorithm activates an edge and its two associated agents at each iteration and thus requires two-sided 
communication at each activation. We can recover their algorithm as a special case by activating an edge 
(i, j) G E and its associated agents i and j at each iteration, updating the dual variables z,j tl and Zj.jj 
associated with the edge, as well as computing the intermediate variables Xi , Xj. and yij. The updates 
are derived from (27) with the orders of x and y swapped. Note that [56] does not consider the situation 
that adjacent edges are activated in a short period of time, which may cause overlapped computation and 
delay communication. Indeed, their algorithm corresponds to r = 0 and the corresponding stepsize r) k = 1. 
Appendix B presents the steps to derive the algorithms in this subsection. 


3 Convergence 

We establish weak and strong convergence in Subsection 3.1 and linear convergence in Subsection 3.2. Step 
size selection is also discussed. 


3.1 Almost sure convergence 

Assumption 1 Throughout the our analysis, we assume p m j n := miniPi > 0 and 

Prob(ife = z | A fc ) = Prob(ifc = i) = Wi,k. (29) 

We let | J{k)\ be the number of elements in J{k) (see Subsection 1.2). Only for the purpose of analysis, we 
define the (never computed) full update at fcth iteration: 

x k+1 := x k - rj k Sx k . (30) 

Lemma 1 below shows that T is nonexpansive if and only if S is 1/2-cocoercive. 

Lemma 1 Operator T : TL TL is nonexpansive if and only if S = I — T is 1/2-cocoercive, i.e., (x — y,Sx — 
Sy ) > ^||Sa; - <S'z/|| 2 , V x,y GH. 
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Proof See textbook [ 6 , Proposition 4.33] for the proof of the “if” part, and the “only if” part, though missing 
there, follows by just reversing the proof. 

The lemma below develops an an upper bound for the expected distance between x k+1 and any x* £ FixT. 


Lemma 2 Let ( x k )k>o be the sequence generated by Algorithm 1. Then for any x* £ FixT and 7 > 0 (to 
be optimized later), we have 


E(||a; fe+1 - x* 




d -\-1112 


j_ (\jm 


\ 7 


- k 1 l|s " ~ * 


/■v.fc+1 II2 


Proof Recall Prob(ifc = i) = pt. Then we have 


= E 


E (||a; fe+1 — x 


* 112 


X k 


x k - 


Vk 


Si,x k - x 


* 112 




= \\x k -x*\\ 2 - 


-E 


(29) 


') 


2 Vk 

m - Pi k 


(Sr, 


- x k ) + 


7$rW s ^ x 


k\\2 | 


x h 


= Ik* - *T + ^ HZi <$**. x* -x k ) + ^ ET=i 


k ||2 


= IIa^ fc — x*|| 2 ■ 


2 Vk 


(Sx k ,x* -x k ) + ^E *=1 j-JSiX 


k II2 


(31) 


(32) 


Note that 


rn 1 rri 

£-u^ fe ii 2 <—£ 11 $ 

Pi 7=i 


X k \\ 2 = 


—||si fe f ( => -A—| 

Pmin VkPmin 


x k -x k+1 \\ 2 , 


(33) 


and 


(Sx k ,x*-x k ) 

= (Sx k ,x * - x k + D deJ{k) (x d - z d+1 )) 

( =\Sx k ,x* -x k ) + ± £ deJ(fe) <* fe - * k+ \x d - * d+1 > 
<{Sx k - Sx* ”* - ~ fc 


, X * - X K ) + 2 ~ 7f u\ ( V 11 *"' - X‘ 


Wk (±lk* - 5 fc+1 || 2 + 7ll* d - * d+1 || 2 ) 


(34) 


< - ill^n 2 + E Edgj(fc)(^ll* fc - * 


1.11 rpk _ ™fc+l||2 


+ 7 \\x a - X' 


d _ rpd+l ||2\ 


( = - ^\\x k x k+1 \\ 2 + &\x k - * fc+1 || 2 + iE 


2n 


27?7fc 


2r; fc Z^ideJ(k) 


| X d - * d+1 || 2 , 


where the first inequality follows from the Young’s inequality. Plugging (33) and (34) into (32) gives the 
desired result. 


We need the following lemma on nonnegative almost supermartingales [50]. 
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Lemma 3 ([50, Theorem 1]) Let & = [T k )k> o be a sequence of sub-sigma algebras of T such thatWk > 0, 
T k jrfc+i D e fi n e as the set of sequences of [0, +oo )-valued random variables ( f,k)k>o, where fk is 

T k measurable, and i\{&) := {{fk)k>o £ t+{&)\ Ek <+oo a.s.}. Let (a k )k>o, (vk)k>o £ £+{&), and 
{Vk)k>Oi (£fc)fc>o £ £\(&) be such that 

E(a fc+1 | J*) + v k < (1 + + Vk- 

Then (vk)k> o £ £\(&) and ak converges to a [0,+oo)-i lalued random variable a.s.. 

Let H T+1 = li: =0 H be a product space and (• | •) be the induced inner product: 


(( z 


1 1 (y° 


)> = 5>*.i/*>, v ^° 

z=0 


r ),(y° 


,y T ) e y. 


r +1 


Let M' be a symmetric (r + 1) x (r + 1) tri-diagonal matrix with its main diagonal as y/p m i n [ -- + t, 2r — 

1, 2r — 3,..., 1] and first off-diagonal as — ^/Pminb ", T ~ 1, •.., 1], and let M = M’ <g> 1^. Here ® represents 
the Kronecker product. For a given (y°, • • • , y T ) £ Tl T+1 , (z°, • • • , z T ) = M(y°, ■ ■ ■ , y r ) is given by: 

2° = 2/° + Vp^(y° - y 1 )- 

= ypmin ((i-r- l)y q ’~ l + (2t - 2* + l)y l + (? - t) 2 / i+1 ) , if 1 < i < t — 1, 

2 r = sJy m \n{y T - y Tl )- 

Then M is a self-adjoint and positive definite linear operator since M' is symmetric and positive definite, 
and we define (• | ■) M = (• | M-) as the M-weighted inner product and || • \\m the induced norm. Let 

x fc = (x k ,x k ~ 1 ,..., x k ~ T ) £ TL T+1 , k > 0, and x* = (x* ,x *,..., x*) £ X* C Ti r+1 , 

where we set x k = x° for k < 0. With 


6c(x*) : = ll xfe - x *llM = ll* fe ^ 2 : 1 | 2 + VS^Ei=fc-r (* - ( k - T ) + !) II** - * i+1 || 2 > (35) 


we have the following fundamental inequality: 

Theorem 2 (Fundamental inequality) Let (x k )k >o &e f/ie sequence generated by ARock. Then for any 
x* £ X*, it holds that 


E 


(£ fc+ i( x *) | **) + £(£ 


^—) \\x k+1 -X k \\ 2 < £fc( x *)- 


my/p m in rnp„ 

Proof Let 7 = Since J{k) C {k — 1, • • • , k — r}, then (31) indicates 

E(||a; fc+1 - x*|| 2 | X k ) <||ai fc - x*|| 2 + -jA= E^Lr II** - ** 


(36) 


if- 


+ 


. / , -a: — x' 

m \ m^Pmin mp m in ??fc ' " 


fc - ! - ! 112 


From (3) and (30), it is easy to have E(||ic fc - x k+1 \\ 2 \X k ) < 


m Prr 


\x k - x k+1 \\ 2 . 


(37) 

which together with (37) 


implies (36) by using the definition of £fc(x*). 

Remark 1 (Stochastic Fejer monotonicity) From (36), if 0 < rjk < 2r™pT+ 1’ then we have E(||x fe+1 — 


I M 


X k ) < ||x fe — x*||^f, Vx* £ X*. 
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Remark 2 Let us check our step size bound ^t==x t- Consider the uniform case: p m i n = Pi = Then, 

vPinin i L TIT 

the bound simplifies to 1 + 9 r/^/ni ' ^ the max delay is no more than the square root of coordinates, i.e., 
r = 0(y/m), then the bound is 0(1). In general, r depends on several factors such as problem structure, 
system architecture, load balance, etc. If all updates and agents are identical, then r is proportional to p , 
the number of agents. Hence, ARock takes an 0(1) step size for solving a problem with m coordinates by 
p = yfm agents under balanced loads. 

The next lemma is a direct consequence of the invertibility of the metric M. 

Lemma 4 A sequence ( z k )k>o C TL T+1 (weakly) converges to z £ TL T+1 under the metric (■ \ ■) if and only 
if it does so under the metric (• | ■) M . 

In light of Lemma 4, the metric of the inner product for weak convergence in the next lemma is not 
specified. The lemma and its proof are adapted from [15]. 

Lemma 5 Let (x k ) k >o CH be the sequence generated by ARock with r/k £ [?7min, 2r c J^p + y] for any ipnin > 0 
and 0 < c < 1. Then we have: 

(i) Efeto \\ xk - x k+1 \\ 2 < oo a.s.. 

(ii) x k — x k+1 -» 0 a.s. and x k — x k+1 —► 0 a.s.. 

(iii) The sequence (x. k )k>o C 'H T+1 is bounded a.s.. 

(iv) There exists Li € T such that P{fi) = 1 and, for every lo £ fi and every x* £ X*, (||x fe (w) — x*||M)fc>o 
converges. 

(v) Let 3f{x. k ) be the set of weakly convergent cluster points of (x k )k>o- Then, 2f(x k ) C X* a.s.. 

Proof (i): Note that inf* > 0. Also note that, in (36), ||5 fe+1 - a ; fc || 2 = \\r] k Sx k \\ 2 is 

A^-measurable. Hence, applying Lemma 3 with fk = Vk = 0 and a k = £jt(x*), Vfc, to (36) gives this result 
directly. 

(ii) From (i), we have x k —x k+1 -4- 0 a.s.. Since \\x k — £ fe+1 || < . ||:r fc — x k+1 \\, we have x k ~x k+1 -4- 0 

a.s.. Then from (5), we have x k — x k —)■ 0 a.s.. 

(iii) : From Lemma 3, we have that (||x fc — x *|| 2 M ) k >o converges a.s. and so does (||x fe — x*||jv/)fc>o, he., 
Hindoo ||x fe — x*||m = 7 a.s., where 7 is a [0,+oo)-valued random variable. Hence, (||x fc — x *|| M ) fc > 0 must 
be bounded a.s. and so is (x fc )/ c > 0 . 

(iv) : The proof follows directly from [15, Proposition 2.3 (iii)]. It is worth noting that fi in the statement 
works for all x* £ X*, namely, fi does not depend on x*. 

(v) : By (ii), there exists fi £ T such that P(Li) = 1 and 

x k {w) - x k+1 (w) -»• 0, \/w £ fi. (38) 

For any w £ fi, let (x fe " (w))„>o be a weakly convergent subsequence of (x fc (w))fc> 0 , i.e., x fc "(w) — ^ x, where 
x fc "(w) = (x kn (oj),x krl ~ 1 (uj)...,x kTl ~ T (w)) and x = (u°,...,u T ). Note that x kn (cv) —^ x implies x kn ~i(uj) —^ 
v? , Vj. Therefore, u 1 = for any i,j £ {0, • • • ,r} because x krl ~ l (uj) — x kn ~^(ej) —> 0. 

Furthermore, observing rjk > Vmin > 0, we have 

lim x fc "(w) - Ti fc "(w) = lim Sx kn (uj) = lim — (x k "{u>) - 5 fc " + 1 (w)) = 0. 


(39) 
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From the triangle inequality and the nonexpansiveness of T, it follows that 
\\x k "{u)-Tx k -{u)\\ 

= \\x k - (u) - x kn (w) + (w) - Tx k " (w) + Tx k " (w) - Tx k " (w) || 

< ||x' fc "(w) - x kn (w)|| + ||* fc »(a;) - T£ fc "(w)|| + ||T^ fe "(w) - Tx k -{w)\\ 

<2 ||x fc "(ti;) — i fe "(a;)|| + \\x kn (u) — Tx kn (u>)\\ 

<2T,d£j(k n ) ||* d M - x d+1 (u )|| + \\x k "(u) - Tx k "(u )||. 

From (38), (39), and the above inequality, it follows lim^oo x krl (w) — Tx kn (w) = 0. Finally, the demiclosed- 
ness principle [6, Theorem 4.17] implies u° £ FixT. 

Theorem 3 Under the assumptions of Lemma 5, the sequence (x fc )/ c > 0 weakly converges to an X* -valued 
random variable a.s.. In addition, if T is demicompact at 0, (x fc )fc>o strongly converges to an X* -valued 
random variable a.s.. 


Proof The proof for a.s. weak convergence follows from Opial’s Lemma [47,42] and Lemma 5 (iv)-(v). Next we 
assume that T is demicompact at 0. From the proof of Lemma 5 (v), there is 12 € T such that P(l7) = 1 and, 
for any w £ 12 and any weakly convergent subsequence of ( x kn (tu)) n >o, lim^oo x kr, (w) — Tx kn (w) = 0. Since 
T is demicompact, (x kn (w)) n >o has a strongly convergent subsequence, for which we still use (x kn (w)) n >o- 
Hence, x kn {w) — > x{w) £ FixT. Lemma 5 (ii) yields x kn (w) x(w) £ X*. Then by Lemma 5 (iv), there 
is fl £ T such that P(12) = 1 and, for every w £ 17 and every x* £ X*, (||x fc (u>) — x*||Af)fe>o converges. 
Thus, for any w £ 12 fl 17, we have lim^oo ||x fe (w) — x(w)||m = 0. Because P(17 n 17) = 1, we conclude that 
(x fc )fc> 0 strongly converges to an X*-valued random variable a.s.. 


Remark 3 For the generalization in Section 1.3, we need to replace (33) by 


eie £ii Ui o s#w* < ± eie ii u t o sx k r < £-jsx k r = ^-\\x k - 

and update the step size condition to m £ hmin, oE?==T7^1- Then the proofs of Theorem 3 and Lemma 5 

L ^'y/P minTL' J 

will go through and yield the same convergence result. 


3.2 Linear convergence 


In this section, we establish linear convergence under the assumption that S is quasi-strongly monotone. We 
first present a key lemma. 


Lemma 6 Assume that the step size is fixed, i.e., r/k = T and satisfies 

0 < n < n — (l — p 1/2 ~i 

U ^ U — ZZi ■“ p) 8 p(T + l)/2_] 

for some p > 1. Then we have, for all k > 1, 

E||a; fe — a; fe_1 || 2 < P E\\x k+1 - x k \\ 2 . 


(40) 


(41) 
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Proof We prove (41) by induction. First, based on the inequality ||a || 2 — || 6|| 2 < 2||a||||& —a|| we observe that, 
for any k > 1 , 


\x k - x k ~ l \\ 2 - \\x k+1 - x k \\ 2 < 2\\x k - s fc_ 1 ||||z fc+1 - x k - x k + x k ~ l \ 

= 2\\x k -x k - 1 \\\\r,S(x k )- V S(x k - 1 )\\ 
<4jj||a fc -a: fc - 1 ||||i fc -i fc - 1 ||. 


(42) 


Applying the triangle inequality and (5) yields 


\\x k - <||x fe - x fe || + \\x k - x^ 1 ! 


m fe -r _ .fc-r. 


< E \\ x<i — x d+1 || + ||x fc — x fe_1 || + E \\x d - x d+1 \ 


d£j(k) d£j(k- 1) 

<2Er=oii* fc - t -* fc - t ’ t ii- 

For the basic case, we have x° = x°, x 1 G {x^x 1 }. Letting k = 1 in (42) gets us 

Ep 1 — rc°|| 2 -E||x 2 -x 1 !! 2 <4r ? E||5 1 -x°||||a ; 1 -x°|| 

< 2l l( m Jp— E \\x 1 - £°|| 2 + myjy m inEUx 1 - x°|| 2 ) 

= 2 v{^l^ E \\x 1 ~ x °\\ 2 + m Vp^Y;Zi Pi^( s ix 0 ) 2 ) 

^ - x °n 2 + ibJ= E ii 51 - ^ll 2 ) 

=E||x 1 — x° 


(43) 


/P min 

= 4 V plUl _ ^.01|2 

™.y/p— 


Rearranging the above inequality yields EHx 1 — aa°|| 2 < - 1 4 „ -E||x 2 — x 1 ]] 2 . By (40) and p > 1, it holds 

m V / Pmin 

that 0 < rj < (1 - < (1 - Hence, Ep 1 - x°|| 2 < pE||x 2 - a: 1 !! 2 . 

For the induction step, applying Young’s inequality gives us 


\x k -x^Hlx*"* -x^ 4 " 1 ! 


< iE{a||a; fc -‘ -a ; fe - t - 1 || 2 


+ ±\\x k -x k -i\\ 2 } 


<|E{- 


a 11 — k—t nf ,k—t—l \\2 | 111 nr>k—1||2 


\X — X 


+ ±\\x K -r 

k—1 112 


} 


m-y/pmin 

Taking the expectation on (43) and combining it with (42) yield 


m'Pmin ' 

< + i }E||#-a; fc - 1 | 

— 21 m-p min 1 a J II I 

E||x fe - x fe_1 || 2 . (letting a = m^/p miri p~ t/2 ) 


E||x fc - x ^- 1 !) 2 - E||x fe+1 - x fe || 2 < 877 EEo E ll^ fc - x fc— 1 1|||x fc— * - 


Jt =0 J 

n p t/ 2 E||x fc — x fc— 1 1| 2 < —¥= 1 YU/ 2 /2 E||x fc - X 

— my'Prnin ^* = 0 ^ M M — m^/Pmin 1-p 1 / 2 11 


X 

k— 1II2 


Finally, rearranging the above inequality and using (40) lead to E||x fc — x k 1 || 2 < pE||x fc+1 — x fc || 2 . This 
completes the proof. 


With this lemma, we are ready to derive the linear convergence rate of ARock. 
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Theorem 4 (Linear convergence) Assume that S is quasi-p,-strongly monotone with p, > 0. Let /3 £ (0,1) 
and (x k )k> o be the sequence generated by ARock with a constant stepsize p £ (0, min {?7 , ?7 2 }], where t) is 
given in (40) and 


—2 = 


2 a 


Then 


_ — b+yJb 2 +A(l—/3)a ^ 2/3/ir p(p T — 1) ^ 1 

■ CL o -i • 0 

’ m^p min p-1 ’ mp min 


,.*l|2 


_2_ / p(p T -l)r 

mV (p-l)Pmin 


E(||x fc -x*|| 2 ) < (l - ’ ||x° - x* 

Proof Following the proof of Lemma 2 and starting from (34), we have 
( Sx k , x* — x fc ) 

<(Sx k - Sx*,x* - x*> + ^ E d6 x (fc) - S fe+1 || 2 + 7lk d - ^ +1 || 2 ) 

< - Pp\\x k ~ X*|| 2 - ^\\Sx k r + ± E d em^\\x k - S fc+1 ll 2 + 7lk d - x d+1 \\ 2 ) 


(44) 

(45) 


= ^ Pp\\x k - x* + Ede./(fc)O d “ 


c d+1 )H 2 ^l 


r fc+1112 


' 27 T) N 
<-^\\x k -x 


+ Z, E deJ (k) \\x d - X 


d-\- 1II2 


* II2 . 


73^11 E 


deJ(fc) 


(x d — x 1 


d+1 ^\\ 2 - \^-\\x k - x k+1 \\ 2 


ITffil 

27 V 

dRWn-k 


x k - x fc+1 || 2 + ^ EdeJ(fc) \\ x<i - x d+1 \\ 2 


< ^ f IK - ®*|| 2 + fo\J(k)\ E deJ(fc) IK - z d+1 || 2 - ^|K - X' 


rf>k-\-\ | | 2 


l^(fc)h 

2777 I 


K - x fe+1 || 2 + ^ Edej(fc) \\x d - X d+1 \\ 2 , 


where the second inequality holds because S is ^-cocoercive and also quasi-/K-strongly monotone, and the 
last one comes from the Cauchy-Schwartz inequality. Plugging the above inequality and (33) into (32) and 
noting | J(fc)| C {k — t, ... ,k — 1} gives 


E(||x fc+1 — x* || 2 


X k ) <(l - K7K - X*|| 2 + i (2 faHT + i) YlZl- T \\x d - X d+1 || 2 


i f 


\X — X 
1 


m \ 7 mprr 


w 

V 


\x k - x fe+1 |l 2 . 


Taking expectation over both sides of the above inequality, noting E||x d — x d+1 || 2 < m2 * E|jx d — x d+1 || 2 , 
and using Lemma 6, we have 


(2 fdiyj,T + 7 ) Ed=i P dE \\x k - x k+1 \ 


' fc+1 -X*|| 2 ) 



^)E||x fe - 

x*|| 2 + 

1 

ITT’ 3 Pmin 

( T \ 1 

1 -/ 3 ) 

E||x fc - 

V 7 1 rapmin 

V ) 

^)E||x fc - 

x*|| 2 + 

1 

m 3 Pm in 

(7 | 1 

1-/3 ) 

E||x fc - 

V 7 1 mpmin 

■n ) 

^)E||x fe - 

x*|| 2 



('.2f3r]pT + 7 ) p ^_ 1 1 ^ E||x fc — x 
_ ||2 


rrk-\- 11 | 2 


+ i (AAnpjL PiPLzll + 2_ / P(P T -I)r + -iK E || fc _ 

m ym 7 min p -1 m y (p-l)Pmin mpmin V ) 11 

<(l-^)E||x fe -x*|| 2 , 


k-\- 1112 























ARock: Async-Parallel Coordinate Updates 


23 


where we have let 7 = m-^J in the second equality, and the last inequality holds because of the 
choice of 77 . Therefore, (45) holds. 

Remark 4 Assume ik is chosen uniformly at random, so p m ; n = 77 We consider the case when m and r 
are large. Let yfp = 1 + A Then from the fact that (1 + ^) fc increasingly converges to the natural number 

e, we have from (40) that 77 = O(^P). In addition, note from (44) that a = 0(b 2 ) = O(^), and thus 

r/ 2 = O(^P). Therefore, if r = O(mi), then the stepsize in Theorem 4 can be 77 = 0(1). Hence, linear 
speedup can be achieved. 


4 Experiments 

We illustrate the behavior of ARock for solving the i\ regularized logistic regression problem. Our pri¬ 
mary goal is to show the efficiency of the async-parallel implementation compared to the single-threaded 
implementation and the sync-parallel implementation. 

Our experiments run on 1 to 32 threads on a machine with eight Quad-Core AMD Opteron™ Processors 
(32 cores in total) and 64 Gigabytes of RAM. All of the experiments were coded in C++ and OpenMP. We use 
the Eigen library 5 for sparse matrix operations. Our codes as well as numerical results for other applications 
will be publicly available on the authors’ website. 

The running times and speedup ratios of both sync-parallel and async-parallel algorithms are sensitive to 
a number of factors, such as the size of each coordinate update (granularity), sparsity of the problem data, 
compiler optimization flags, and operations that affect cache performance and memory access contention. 
In addition, since all agents in the sync-parallel implementation must wait for the last agent to finish an 
iteration, a large load imbalance will significantly degrade the performance. We do not have the space in this 
paper to present numerical results under all variations of these cases. 


4.1 l x regularized logistic regression 


In this subsection, we apply ARock with the update (12) to the t\ regularized logistic regression problem: 


minimize All a;I1 1 + 

xe»" 


1 

N 


N 

Y log (1 + ex P (~ b i ' a I X )) > 

i=1 


(46) 


where {(cq,&i)}A 1 is the set of sample-label pairs with h.; £ {1,-1}, A = 0.0001, and n and N represent 
the numbers of features and samples, respectively. This test uses the datasets 6 : rcvl and news20, which are 
summarized in Table 1. 


Name 

# samples 

# features 

# nonzeros in {ai,..., a n} 

rcvl 

20, 242 

47, 236 

1, 498, 952 

news 20 

19, 996 

1, 355, 191 

9, 097, 916 


Table 1: Two datasets for sparse logistic regression. 

5 http://eigen.tuxfamily.org 

6 http://www.csie.ntu.edu.tw/-cjlin/libsvmtools/datasets/ 
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0 200 400 600 800 1000 

coordinate (each has ~50 features) 


npwOfl 



0 5000 10000 15000 20000 25000 30000 


coordinate (each has ~50 features) 


(a) dataset rcvl (b) dataset news20 

Fig. 3: The distribution of coordinate sparsity. Each dot represents the total number of nonzeros in the 
vectors a* that correspond to each coordinate. The large distribution in (b) is responsible for the large load 
imbalance and thus the poor sync-parallcl performance. 


We let each coordinate hold roughly 50 features. Since the total number of features is not divisible by 
50, some coordinates have 51 features. We let each agent draw a coordinate uniformly at random at each 
iteration. We stop all the tests after 100 epochs since they have nearly identical progress per iteration. The 
step size is set to % = 0.9, Vfc. Let A = [oi,..., a^] T and b = [bi ,..., 6 tv] t . In global memory, we store A, b 1 
and x. We also store the product Ax in global memory so that the forward step can be efficiently computed. 
Whenever a coordinate of x gets updated, Ax is immediately updated at a low cost. Note that if Ax is not 
stored in global memory, every coordinate update will have to compute Ax from scratch, which involves the 
entire x and will be very expensive. 

Table 2 gives the running times of the sync-parallel and ARock (async-parallel) implementations on the 
two datasets. We can observe that ARock achieves almost-linear speedup, but sync-parallel scales very poorly 
as we explain below. 

In the sync-parallel implementation, all the running cores have to wait for the last core to finish an 
iteration, and therefore if a core has a large load, it slows down the iteration. Although every core is 
(randomly) assigned to roughly the same number of features (either 50 or 51 components of x) at each 
iteration, their aj’s have very different numbers of nonzeros (see Figure 3 for the distribution), and the core 
with the largest number of nonzeros is the slowest (Sparse matrix computation is used for both datasets, 
which are very large.) As more cores are used, despite that they altogether do more work at each iteration, the 
per-iteration time reduces as the slowest core tends to be slower. The very large imbalance of load explains 
why the 32 cores only give speedup ratios of 4.0 and 1.3 in Table 2. 

On the other hand, being asynchronous, ARock does not suffer from the load imbalance. Its performance 
grows nearly linear with the number of cores. In theory, a large load imbalance may cause a large r, and thus 
a small However, the uniform rfr- = 0.9 works well in all the tests, possibly because the a^s are sparse. 

Finally, we have observed that the progress toward solving (46) is mainly a function of the number of 
epochs and does not change appreciably when the number of cores increases or between sync-parallel and 
async-parallel. Therefore, we always stop at 100 epochs. 
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# cores 

rcvl 

news20 

Time (s) 

Speedup 

Time (s) 

Speedup 

async 

sync 

async 

sync 

async 

sync 

async 

sync 

i 

122.0 

122.0 

1.0 

1.0 

591.1 

591.3 

1.0 

1.0 

2 

63.4 

104.1 

1.9 

1.2 

304.2 

590.1 

1.9 

1.0 

4 

32.7 

83.7 

3.7 

1.5 

150.4 

557.0 

3.9 

1.1 

8 

16.8 

63.4 

7.3 

1.9 

78.3 

525.1 

7.5 

1.1 

16 

9.1 

45.4 

13.5 

2.7 

41.6 

493.2 

14.2 

1.2 

32 

4.9 

30.3 

24.6 

4.0 

22.6 

455.2 

26.1 

1.3 


Table 2: Running times of ARock (async-parallel) and sync-parallel FBS implementations for the l\ regular¬ 
ized logistic regression on two datasets. Sync-parallel has a very poor speedup due to the large distribution 
of coordinate sparsity (Figure 3) and thus the large load imbalance across cores. 

5 Conclusion 

We have proposed an async-parallel framework, ARock, for finding a fixed-point of a nonexpansive operator 
by coordinate updates. We establish the almost sure weak and strong convergence, linear convergence rate 
and almost-linear speedup of ARock under certain assumptions. Preliminary numerical results on real data 
illustrate the high efficiency of the proposed framework compared to the traditional parallel (sync-parallel) 
algorithms. 
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A Derivation of certain updates 

We show in details how to obtain the updates in (16) and (21). 


A.l Derivation of updates in (16) 

Let x = (xi ,... ,Xm) £ / H rn , 

m 

f( x ) := 30) 

i=1 
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where g(x) equals 0 if x\ = • • • = Xm and oo otherwise. Then (15a) reduces to 


x k = argmin g(z) -\ - \\z — z\\ = argmin \\z — z 

z£'H rn 2 ^ z£'H rn :zi =■ ■■=z m 


- arg min 
z£'H rn :zi=---=z rr 


m / m m \ 

E 11*1 -*?ll 3 “(-Eaf.---.-E*?) e H m , 


where the last equality is obtained by noting that z\ = — z i i s the unique minimizer of \\ z l ~ z i II- Next, (15b) 


reduces to 


1 m 

y k = argmin f(z)+ — \\z - (2x k - z k )\\ 2 = argmin V||^-(2 x k - z k ' 
zen™ 27 z-.ZieCi^if^ 


It is easy to see that y k = Proj^. (2x k — z k ),Wi. 

Since (15c) only updates the th coordinate of 2 :, we only need x k ' and y k ^, and thus in (16a) and (16b), we only compute 
and y k k . Plugging the above x k and y k into (15c) gives (16c) directly. 


A.2 Derivation of (21) 

We first show how to get (18). The Lagrangian of (17) is L(x,y,w) = f(x) + g(y) — (w,Ax + By — b), and the Lagrange dual 
function is 


d(w) = min L(x,y,w) 
v ' xeni t y£n 2 

= ( mm f(x) - (A*w,x )) + ( min g(y) - (B*w,y)) + (w,b) 

= - ( max —f(x) + (A*w,x)) - ( max —g(y) + (B*w,y)) + (w, b) 

= ~ f*(A*w) - g*(B*w) + (w, b), 

where the last equality is from the definition of convex conjugate: f*(z) = max x ( 2 ,a:) — f(x). Hence, the dual problem is 
maxu, d(w ), which is equivalent to (18). 

Secondly, we show why z+ = prox 7 . dg (z) is given by (20). Note 

min s d g (s) + ^||s — z\\ 2 = min s g*(B*s) — ( s , b) + ^||s — z || 2 

= min s ma x y (B*s,y) - g(y) - {s, b) + V ||s - z\\ 2 
= ma x y min a (B*s,y) - g(y) - {s, b) + V||s - z\\ 2 
= max s min s (s, By - b) - g(y) + ^||s - z\\ 2 
= max s —g(y) + (z,By -b)-%\\By- b|| 2 
= - min v g(y) - ( z,By - b) + f ||By - b\\ 2 , 

where the fifth equality holds because s* = z — 7 (By — b) = arg rnin. s (s, By — b) + ^-||s — z\\ 2 . Hence, by the definition of the 
proximal operator and the above arguments, we have that z+ = prox^. dg ( z ) can be obtained from (20). Then (19) is from (20) 
through replacing g to /, B to A, and 6 to 0. 

Finally, it is straightforward to have (21) by plugging (19) and (20) into (15). 
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B Derivation of async-parallel ADMM for decentralized optimization 

This section describes how to implement the updates (21) for the model (27). 

In (27), g(y) and b vanish and, corresponding to the two constraints Xi = yij and Xj = yij, the two rows of matrices A and 

, where • • • are zeros, the two coefficients 1 correspond to Xi and Xj , and the two 

coefficients —1 correspond to yij. Then, (21a) and (21b) can be calculated as 

Vli — (Zli,l + ^Zi,i)/(27) VZ G L(i ), 

= (4,i - 4,r)/2 VI e L(i), 

Sir = (4r,i + 4r,r)/( 2 1) Vr S -R(t), 

(W k g )ir,i = (4,i - &,r)/2 Vr S 

In addition, 2 ;^ can be obtained by solving (28a), and both zJVf 1 and .sTjh 1 can be updated from (28b) and (28c). 

Furthermore, as mentioned in Section 2.6.2, we can derive another version of async-parallel ADMM for decentralized 
optimization, which reduces to the algorithm in [56], by activating an edge (i, j) G E instead of an agent i each time. In 
this version, the agents i and j associated with the edge ( i,j ) must also be activated. Here we derive the update (21) for the 
model (27) with the update order of x and y swapped. Following (21) we obtain the following steps whenever an edge (i,j) G E 
is activated: 


B are 


=argmin/i(x i ) - ( ^ 4,i + E *ir,i) x i + | |®(*)l ' IM 2 
Z£L(i) r£R(i) 

Xj re-R(j) 

(*$)«,< = - 7 *? 

Sij = arg imn(2(Wf)ijj - z£ vi + - iy,.,-, J/y) + |||?/ij|| 2 

yij £ 

(u>g)ij,i = ~ *ij,i + iSij 

(*s)«,i = 'Aw))ij,3 - *ij,j + rSij 

44 = 4,J + 'lk((Wg)ijJ - (li-f)ij.j)- 

Every agent z in the network maintains the dual variables 2 :^^, l G L(z), and r G R(z), and the variables x,y,w are 

intermediate and do not need to be maintained between the activations. When an edge (i,j) is activated, the agents z and j 
first compute their {xj, and {cc *-, independently and respectively, then they collaboratively compute y ^., and 

finally they update their own . and ., respectively. We allow adjacent edges (which share agents) to be activated in a 
short period of time when their updates are possibly overlapped in time. When r = 0, i.e., there is no simultaneous activation 
or overlap, it reduces to the algorithm in [56]. 









